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ABSTRACT 

Signals  and  tine  series  often  are  not  homogeneous  but  rather  are 
generated  by  nechanisas  or  processes  with  various  phases.  Siailarly, 
images  are  not  hoaogeneous  but  contain  various  objects.  "Segmentation" 
is  a  process  of  attempting  to  recover  automatically  the  phases  or 
objects.  A  model  for  representing  such  signals,  time  series,  and  images 
is  discussed.  Some  approaches  to  estimation  and  segmentation  in  this 
model  are  presented. 


Key  words  and  phrases:  statistical  pattern  recognition, 
classification;  temporal  correlation,  spatial  correlation;  optimization 
by  relaxation  method. 
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1 .  Introduction 

About  ten  years  ago,  Professor  Jaaes  Osterburg,  a  colleague  at  the 
university  of  Illinois  at  Chicago  and  an  expert  on  physical  evidence 
(Osterburg  and  O'Hara,  1949;  Osterburg,  1968,  1982),  consulted  ae  in 
regard  to  establishing  probability  estimates  for  partial  fingerprints. 
Ve  have  had  a  very  interesting  collaboration  which  resulted  in  several 
papers  (Osterburg,  Parthasarathy,  Raghavan,  and  Sclove,  1979;  Sclove, 
1979,  1980,  1981a)  and  say,  thanks  to  Professor  Osterburg's  continuing 
efforts,  result  in  changes  in  practices  regarding  the  evaluation  of 
fingerprints  as  evidence.  For  the  present  paper  the  point  about 
fingerprints  is  this:  When  at  professional  Beatings  I  would  talk  about 
the  subject,  certain  people  (usually  electrical  engineers  or  coaputer 
scientists)  would  tell  ae  that  what  I  was  doing  was  "laage  processing." 

The  reason  that  our  fingerprint  work  reseables  laage  processing  is 
this.  Osterburg  treated  fingerprint  analysis  by  placing  a  grid  of  cells 
over  the  print.  One  then  notes  the  locations  of  any  occurrences  of  the 
"Gal ton  details,"  the  alnutiae  of  the  ridge  lines,  such  as  ridge  endings 
and  forks.  In  nuaerical  laage  processing  a  real  laage  is  divided  into 
cells  ("pixels":  picture  elements)  and  one  notes  numerically  what 
occurs  in  each  cell.  The  real  laage  is  expressed  as  a  matrix  —  rows  by 
coluans  —  of  cells,  just  as  the  TV  screen  has  a  matrix  of  dots  which 
are  llluainated  with  various  colors  and  intensities. 
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2.  I sages 

An  important  aspect  of  Image  processing  is  image  "segmentation," 
the  location  of  objects  in  the  iaage. 

Examples.  (1)  A  picture  of  a  house  and  yard  is  to  be  labeled  using 
the  labels,  brick,  glass,  tree,  grass,  sky.  (ii)  An  image  is  to  be 
labeled  using  the  labels,  tank,  mud,  tree,  sky.  (iii)  A  medical  iaage 
is  to  be  labeled  using  the  labels,  tumor,  or  normal  tissue. 

Segmentation  involves  labeling  each  pixel  according  to  the  name  of 
the  object  of  which  it  is  a  part,  in  turn  Involves  the  grouping  of 
neighboring  pixels.  Since  I  had  earlier  worked  on  cluster  analysis,  the 
grouping  of  observations  (Sclove  1977),  it  was  natural  to  concentrate  on 
segmentation.  The  labeling  process  can  be  made  explicit  in  a  model 
which  states  that  at  each  pixel  we  observe  the  value  of  a  random 
variable  X,  but  also  along  with  X  there  is  an  unobservable  variable,  the 
label.  (I  tend  to  treat  the  labels  as  parameters.  Others  treat  them  as 
missing  data,  i.e.,  random  variables.)  In  the  context  of  this  model, 
segmentation  is  merely  estimation  of  the  labeling  parameters. 

The  random  variable  X  is  often  a  vector  of  several  measurements. 

Examples.  (1)  A  familiar  example  of  a  vector  of  measurements 
(though  it  would  not  be  measured  across  a  two-dimensional  array)  is 
blood  pressure,  which  is  a  vector  of  the  two  measurements,  systolic  and 
diastolic.  (li)  In  color  television,  X  is  a  vector  of  three 
measurements,  the  red  level,  green  level,  and  blue  level,  (ill)  In 
Landsat  satellite  data,  there  are  four  spectral  channels,  one  in  the 
green/yellow  visible  range,  a  second  in  the  red  visible  range,  and  the 
other  two  in  the  near  infrared  range,  (iv)  For  a  black-and-white  image, 
the  random  variable  X  which  is  simply  a  scalar  consisting  of  a  single 
gray- level  measurement ,  rather  than  a  vector. 

Images  are  two-dimensional ;  I  decided  first  to  consider  a 
one-dimensional  version  of  the  problem.  An  image  is  a  two-way  series;  a 
one-way  series  is  a  "time  series."  So  I  began  this  research  by  thinking 


about  segmentation  of  time  aeries. 
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3.  Tiae  Series 

The  problea  of  segaentation  considered  here  is:  Given  a  tiae 
series 

{x(t),  t=1 ,2,  ,n}, 

partition  the  set  of  values  of  t  into  subseries  (segaents,  regiaes)  for 
which  the  values  x(t)  are  relatively  hoaogeneous  in  soae  sense.  The 
segaents  are  assuaed  to  fall  into  several  classes. 

Examples.  (i)  Segment  a  received  signal  into  background,  target, 
background  again,  another  target,  etc.  (ii)  Segaent  an  EEC  of  a 
sleeping  persion  into  periods  of  deep  sleep  and  restless  or  fitful  sleep 
(two  classes  of  segaent).  (iii)  Segment  an  EGG  into  rhythmic  and 
arhythaic  periods  (two  classes  of  segaent).  (iv)  Segment  an  econoaic 
tiae  series  into  periods  of  recession,  recovery,  and  expansion  (three 
classes  of  segment). 


Next,  several  simulated  tiae  series  will  be  shown.  The  first  is 
relatively  smooth;  the  second,  relatively  rough  or  juapy.  (They  are 
simulated  first-order  autoregressions  with  autoregression  coefficients 
equal  to  +.8  and  -.8,  respectively.) 
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Smooth  tiae  series  (simulated  first-order 
autoregression  with  coefficient  +.8) 
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The  next  series  has  alternating  smooth  and  jumpy  segments. 
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It  is  a  simulated  series,  with  three  segments.  Each  segment  is  a 
first-order  autoregression.  For  t  s  1  to  20  the  coefficient  is  +.8;  for 
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t  s  21  to  30  the  coefficient  is  -.8;  then  for  t  s  31  to  50  the 
coefficient  is  again  +.8.  In  actual  data  analysis  one  would  not  know 
the  segsents  but  would  have  to  find  then  and  estimate  their  parameters. 


Mow  let  us  consider  a  real  exaaple,  rather  than  simulated  data.  The 
graph  below  is  quarterly  (DIP  (Gross  National  Product)  for  55  quarters, 
starting  with  the  first  quarter  of  1946. 

LOG  GNP 

2.700  +  45 

123 

34567890 

89012 

01  67 

2.550  +  789  2345 

123456 

0 

9 

123  78 

2.400  ♦  890  456 

67 

-  345 

-  12 

10  20  30  40  50  60 

Plot  of  Y(T) ,  where  Y(T)  is  log  (base  ten)  quarterly  (DIP  in 
billions  of  current  (unadjusted)  dollars.  T=1  is  1946-1,  T=2 
is  1946-2,  ...,  Ts5  is  1947-1,  etc.  (N=55) 


Here  Is  a  second  difference  of  the  log  quarterly  GNP.  Do  you  see 
anything  particularly  interesting? 
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Plot  of  Mixed  second  difference  of  log  GNP: 
Vertical  scale,  labeled  '10000104',  is  Z(T),  where 
Z(T)  =  1000C (Y(T)-Y(T-I ) )  -  <Y(T-4)-Y(T-5))]  and 
Y(T)  =  log  QJP  (base  10);  T=1  is  1947-2,  T=2  is 
1947-3,  ....  T=9  is  1949-2,  etc.  (N*50) 


A  first  difference  corresponds  to  a  velocity.  The  difference 

y(t)  -  y(t-1)  =  d(t), 

say,  is  Cy(t)  -  y( t— 1 ) ]/C t— C t— 1 ) ] ,  which  is  the  change  in  y  over  the 
one  tine  unit  froa  tlae  t-1  to  tlae  t.  A  second  difference 

d(t)  -  d(t— 1 ) , 

which  in  terns  of  an  original  series  of  y's  is 

{[y(t)-y(t-1)3  -  Cy(t-1)-y(t-2)]}, 

is  proportional  to  the  change  in  velocity  across  the  indicated  tine 
periods  and  hence  is  essentially  an  acceleration.  Thus  a  second 
difference  is  perhaps  a  natural  transforn  to  analyze.  (Note,  however, 
that  the  second  difference  used  here  is  a  nixed  second  difference, 
naaely,  the  ordinary  difference  of  the  lag- four  difference 
y(t)-y(t-4).l  Ths  second-difference  series  appears  level.  Uhat  I  think 
is  particularly  interesting  is  that  the  acceleration  of  the  econoay  due 
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IMAGES 


,  AND 


to  the  Korean  conflict  of  the  1950 's  is  readily  apparent.  (T  =  12  is 
1950-1,  T  =  13  is  1950-2,  etc.)  I  think  that  the  need  for  segmentation 
is  clearly  indicated.  One  needs  to  give  some  sort  of  special  treatment 
to  those  four  exceedingly  high  values. 

An  alternative  approach  to  such  observations  is  to  identify  them  as 
"outliers."  However,  "outlier"  connotes  spuriousness.  If  outlying 
observations  are  non-spurious  or  are  associated  with  a  recurring  cause, 
perhaps  they  should  not  then  be  termed  "outliers."  They  should  be 
modeled. 

In  other  cases,  where,  e.g.,  equipment  failure  is  suspected,  one 
truly  wants  to  look  for  outliers.  Here,  too,  segmentation  can  be 
useful. 

De  Alba  and  Zartman  (1979)  analyzed  radiotelemetrlc  measurements  of 
cows'  temperatures,  in  order  to  locate  the  periods  of  high  estrus,  with 
a  view  toward  more  optimally  timed  breeding  and  efficient  milk 
production.  A  technique  of  de  Alba  and  Van  Ryzin  (1979,  1980)  was 
used.  In  the  de  Alba-Zartman  report  the  analysis  of  the  temperatures  of 
one  cow  over  133  days  is  discussed  in  detail.  Eleven  observations  were 
detected  as  coming  from  distributions  with  means  shifted  relative  to  the 
rest  of  the  observations.  Inspection  of  the  data  showed  that  some  of 
these  temperatures  were  high  and  some  extremely  low.  It  is  concluded 
that  the  high  temperature  readings  correspond  to  times  of  active  estrus. 
(Perhaps  the  low  readings  correspond  to  instrument  failure  and  hence  are 
true  "outliers"  in  that  sense.) 

Here,  as  in  de  Alba  and  Zartman  (1979),  the  raw  data  were 
pre-processed  by  differencing.  Model-selection  criteria  (see  below) 
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estimated  the  number,  k,  of  classes  of  segment  as  two,  but  k  =  3  and 
k  =  6  scored  almost  as  well.  Since  the  numbers  analyzed  were 
differences,  it  is  perhaps  especially  interesting  to  consider  three 
classes,  particularly  since  the  segmentation  yielded  one  class  for 
positive  differences,  one  for  negative  differences,  and  one  for 
differences  close  to  zero.  The  results  obtained  are  similar  to  those 
obtained  by  de  Alba  and  Zartman.  They  identified  eleven  observations  as 
"outliers;"  7  of  these  were  high  and  4  were  low.  The  segmentation 
algorithm  run  with  three  classes  of  segment  identified  4  observations  as 
high;  these  were  among  the  7  identified  by  de  Alba  and  Zartman.  Use  of 
six  classes  provided  even  closer  agreement  to  the  de  Alba-Zartman 
results.  The  upper  two  of  the  six  classes  captured  6  of  the  7  high 
observations;  the  4  low  observations  were  located  by  the  lowest  class. 

(In  discussion  after  the  paper  Professor  Parzen  mentioned  that  at 
Texas  A&M  University  also  they  had  dealt  with  the  analysis  of  bovine 
estrus  and  found  by  spectral  analysis  that  a  filtered  Poisson 
process  —  see,  e.g.,  Parzen  (1962)  —  provided  a  good  fit  to  the  data.) 

Now,  having  discussed  some  examples,  let  me  be  a  little  more 
specific  about  the  model  and  the  algorithm.  [More  formal  presentations 
are  found  in  Sclove  ( 1983a, c, d). ]  The  elements  of  the  segmentation 
model  are  the  class-conditional  time  series  or  distributions,  with  their 
j  parameters;  the  labels;  and  the  transition  probabilities  between  the 

labels.  Correspondingly,  the  algorithm  alternates  between  estimation  of 
the  distributional  parameters,  estimation  of  the  labels,  and  estimation 
I  of  the  transition  probabilities.  That  is,  given  a  tentative  labeling, 

! 
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one  can  obtain  tentative  estiaates  of  the  parameters  of  the 
class-condi tional  distributions  and  of  the  transition  probabilities. 
One  then  relabels  the  observations,  using  these  updated  paraaeter 
estimates.  The  relabeling  is  done  as  follows:  If  x(t)  (i.e.,  if  tine 
period  t)  is  currently  labeled  as  class  c,  then  x(t+1)  is  labeled  as 
that  class  d  for  which  the  product 

p(c  to  d)f(x(t+1)|d) 

is  aaxiaal,  where  p(c  to  d)  denotes  the  current  estimate  of  the 
probability  of  a  transition  from  c  to  d  and  f(x|d)  denotes  the 
tentatively  estimated  class-d  probability  density,  evaluated  at  x. 
This  makes  sense  because,  under  the  assumptions  of  the  model,  the 
likelihood  is  the  product  of  these  terms. 

To  illustrate  the  algorithm,  let  us  consider  a  short,  artificial 
time  series. 


t:  1  2  3  4  5  6  7  8  9  10  11  12 

x(t):  113121267111 

Suppose  it  is  specified  that  there  are  two  classes  and  that  the 
class-conditional  distributions  are  exponential.  Suppose  the  initial 
guesses  of  the  parameters  are  equal  prior  probabilities  for  the  two 
classes  and  means  of  2  and  3*  Then  initially  the  class-conditional 
densities  are  taken  as  f(x|a)  =  (1/2)exp(-x/2)  and  f(xlb) 
=  (1/3)exp(-x/3).  In  the  first  Iteration,  using  equal  prior 

probabilities  of  .5,  one  labels  x  as  having  come  from  Class  A  if 
.5f(xla)  >  .5f(xlb),  which  simplifies  to  x  <  2.43*  This  gives  the 
following  estimated  labels  at  the  end  of  the  first  Iteration. 
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to 


t:  1  2  3  4  5  6  7  8  9  10  11  12 

x(t):  113121267111 

label:  aabaaaabbaaa 

Mom  the  transition  probabilities  can  be  estiaated  froa  the  sequence  of 
estiaated  labels.  The  11  transitions  are:  a  to  a,  a  to  b,  b  to  a, 
a  to  a,  a  to  a,  a  to  a,  a  to  b,  b  to  b,  b  to  a,  a  to  a,  a  to  a.  These 
give  the  folloMing  estiaates  of  the  transition  probabilities. 

p(a  to  a)  s  3/4  p(a  to  b)  *  1/4 

p(b  to  a)  a  2/3  p(b  to  b)  3  1/3 

The  class  neans  are  estiaated  as  follows  (but  see  Section  6:  Plans  for 
Further  Research,  for  a  discussion  of  this  point). 

Mean  of  a's:  (1  +  1  +  1+  2+1+2+1  +  1  +  1)/9  3  11/9  *  1.22 
Mean  of  b’s:  (3  ♦  6  ♦  7)/3  *  16/3  *  5.33 
Mow  the  condition  for  labeling  the  current  x  as  "a",  given  that  the 
preceding  x  has  been  labeled  as  "a",  is 

p(a  to  b)f(xlb)  <  p(a  to  a)f(xla), 
or 

(1/4)(1/5.33)exp(-x/5.33)  <  (3/4)(1/1.22)exp(-x/1.22), 
which  siapllfles  to  x  <  4.075.  Slallarly,  the  condition  for  labeling  x 
as  "a",  given  that  the  preceding  observation  has  been  labeled  as  "bN,  is 
p(b  to  b)f(xlb)  <  p(b  to  a)f(xla),  which  siapllfles  to  x  <  1.24. 
These  second-iteration  classification  rules  change  only  the  label  of 
x(3)  froa  "b"  to  "a". 

At  this  point  let  ae  aention  a  possible  iaproveaent  to  the 
algorltha.  At  the  relabeling  stage,  where  the  labels  are  re-estiaated, 
based  on  tentative  paraaeter  estiaates  and  transition  probabilities,  the 
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problea  is  one  of  estlsating  a  finite  sequence,  given  the  transition 
probabilities.  The  dynaaic  prograaaing  approach  to  this  problea  is 
known  as  the  Viterbi  algor itha.  [Forney  (1971)  is  perhaps  the  aost 
readable  reference  on  this;  the  references  Viterbi  (1967,  1971)  and 
Viterbi  and  Odenwalder  (1969)  are  also  given.]  This  approach  will  be 
applied  in  the  future.  The  present  approach  is  perhaps  soaewhat  slapler 
but  has  Its  advantages.  It  adapts  itself  to  operation  in  the  purely 
sequential  aode,  and  it  is  relatively  easy  to  prograa. 

An  algorltha,  such  as  the  present  one,  which  alternates  between 
optimizing  different  sets  of  variables,  is  known  as  a  "relaxation" 
aethod  (Southwell  1940,  1946;  Ortega  and  Rheinboldt  1970).  In  the 
present  case  the  different  sets  of  variables  are  different  paraaeters, 
naaely,  the  distributional  paraaeters,  the  transition  probabilities,  and 
the  labels. 

He  note  in  passing  that  the  EM  algorltha  —  see  Deapster,  Laird  and 
Rubin  (1977)  —  is  such  a  relaxation  aethod,  where  at  the  "E"  step  the 
estlaation  is  by  expectation  and  at  the  "M"  step  the  estimation  is  by 
maximization. 

An  alternative  approach  to  the  presently-implemented  relaxation 
aethod  would  Involve  equating  partial  derivatives  of  the  likelihood 
function  to  zero  and  solving  the  resulting  equations.  It  is  clear  that 
results  analogous  to  those  of  Holfe  (1970)  will  be  obtained  by  this 
approach,  with  transition  probabilities  replacing  his  mixture 
probabilities.  In  any  case,  the  resulting  equations  have  to  be  solved 
by  an  iterative,  nuaerical  aethod,  and  it  is  not  clear  whether  it  would 
be  any  better  than  the  the  presently-iapleaented  relaxation  aethod. 
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4.  Image-segmentation  experiments 

Conventional  approaches  to  segaentatlon  (see,  e.g. ,  Ahuja  and 
Rosenfeld  (1982)  for  a  survey  of  image  models)  include  ordinary 
clustering;  edge  detectors  such  as  the  two-dimensional  filters  of 
Irwin  Sobel  or  Judy  Prewitt;  and  a  pixel-labeling  method  that  has 
(perhaps  inaccurately)  been  called  a  relaxation  method.  This  Involves 
updating  the  current  estimate  of 

Pr( label  of  t  is  c I data) , 

the  (conditional)  probability  that  the  true  label  of  pixel  t  is  class  c, 
given  the  data,  by  an  updated  version  of  these  estimated  probabilities, 
where  the  updating  takes  into  account  the  current  labels  of  neighboring 
pixels.  This  is  done  by  means  of  ^"compatibility  coefficients," 
measuring  the  consistency  of  label  c  for  pixel  t  with  the  current 
labelings  of  neighboring  pixels.  (See,  e.g.,  Eklundh,  Yamamoto  and 
Rosenfeld  1980.)  More  precisely,  to  estimate  Pr( label  of  t  is  cl data), 
one  moves  from  stage  s-1  to  stage  s  as  follows.  Let 

Pr(label  of  t  is  c|data;s) 
be  the  s-th  stage  estimate.  Then 

Pr( label  of  t  is  c|data;s)  = 

Pr( label  of  t  is  c)ldata;s-1)  C(t,c;s-1)/C, 
where  C  is  a  normalising  factor  equal  to  the  sum  of  the  denominators 
in  this  expression  and  C(t,c;s-1)  is  the  compatibility  coefficient; 
e.g.,  C(t,c;s-1)  could  be  related  to  estimated  transition  probabilities 
for  the  labels  or  taken  as  the  proportion  of  the  neighbors  of  pixel  t 
which  are  labeled  c  at  stage  s-1.  A  problem  with  this  relaxation 
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■ethod,  noted  by  various  researchers,  is  that  the  laage  gets  good,  then 
"fades,"  so  there  is  a  problem  of  knowing  when  to  stop  the  iteration. 

This  approach  is  intuitively  attractive.  The  approach  I  use  seems 
to  have  the  same  Intuitive  attraction,  plus  the  advantage  of  being 
embedded  in  a  Markov  model  which  gives  a  likelihood  function,  putting 
the  problem  in  the  context  of  parameter  estimation  in  a  probability 
model  and  making  performance  evaluation  possible.  (For  Instance,  one 
can  consider  asking  questions  such  as,  "What  if  we  worked  with 
first-order  neighbors  and  the  model  were  really  second-order?"  I'm  not 
saying  that  such  questions  are  easy  to  answer;  I'm  merely  stating  that 
it  is  only  in  the  presence  of  a  model  that  their  formulation  is  even 
possible.) 

The  idea  of  the  model  in  the  case  of  images  is  the  same  as  for  time 
series,  but  the  transition  matrix  la  more  complicated.  The  transition 
probabilities  are,  even  in  the  simplest,  first-order,  one-sided  case 
(where  one  conditions  only  on  pixels  to  the  north  and  west  of  the  given 
pixel,  rather  than  those  to  the  north,  west,  south,  and  east),  functions 
p((c,d)  to  e)  of  three  arguments,  where  this  represents  the  probability 
of  a  transition  to  class  e  in  pixel  t,  given  that  the  pixel  to  the  north 
of  t  is  class  c  and  the  pixel  to  the  west  of  t  is  class  d.  [The  Markov 
approach  used  here  is  not  unrelated  to  that  presented  by  Professor 
Grenander  at  this  conference  (Grenander  1985;  see  also  Grenander 
1983)*]  The  segmentation  algorithm  discussed  in  the  present  paper,  like 
the  "relaxation"  method  using  compatibility  coefficients,  also  has  the 
property  that  each  Iteration  is  not  necessarily  better  than  the 
preceding,  but  there  is  a  likelihood  value  associated  with  each 
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Iteration,  and  these  values  can  be  used  to  pick  out  the  best  Iteration. 

Here  are  results  of  soae  experiments  with  the  segaentatlon 

algor 1 the. 

The  Fisher  Iris  data  consist  of  4  variables  observed  for  each  of 
150  Irises,  50  in  each  of  three  species.  In  order  to  fore  an 
experlnental  inage,  these  150  were  arranged  into  15  rows  of  10,  the 
first  5  rows  being  species  1,  rows  6-10  being  species  2,  and  rows  11-15 
being  species  3-  Thus  the  true  segaentatlon  looks  like  this. 
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3333333333 

3333333333 

3333333333 


The  neasurenents  on  flowers  50,  100,  and  150  were  used  as  initial 
estleates  for  the  distributional  paraeeters,  naaely,  aeans  of 
■ultlvarlate  noraal  distributions.  A  conaon  covariance  natrlx  was 


assuaed.  (Actually,  statistical  tests  and  aodel-selection  criteria 
suggest  that  different  covariance  aatrlces  should  be  used;  ay  algorithms 
do  allow  for  this.)  The  initial  estlaate  of  the  conaon  covariance  was 


taken  to  be  proportional  to  the  Identity  aatrlx.  Equal  prior 
probabilities  for  the  three  classes  were  assuaed  as  initial  estiaates. 
After  only  three  iterations,  no  label  changed.  The  segaentatlon 
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obtained  was  as  follows. 
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Experiments  were  run  with  different  configurations  of  these  data. 
Below  is  the  true  configuration  for  Experiment  1.  The  four -by- four 
block  of  2's  in  the  middle  can  be  thought  of  as  a  target  to  be  located. 


1111111111 
1111111111 
1111111111 
1  1  1  2  2  2  2  1  1  1 
1  1  1  2  2  2  2  1  1  1 
3332222333 
3332222333 
3333333333 
3333333333 
3333333333 


The  segmentation  produced  was  as  follows. 

11111111 

11111111 

11111111 

11132321 

11132221 
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The  following  was  the  true  configuration  for  Experiment  2.  Note 
the  saall  "target"  of  four  2's. 
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The  segmentation  obtained  was  as  follows. 


1111111111 

1111111111 

1111111111 

1111321111 

3333323323 

3333333333 

3333333333 

3333333333 

3333333333 


Note  the  algoritha  worked  reasonably  well  in  detecting  even  a  saall 
target,  even  with  the  saae  initial  (equal)  estimates  of  prior 
probabilities,  quite  different  from  the  true  probabilities  of  .48,  .04, 
.48  for  the  three  classes.  Thus  perhaps  the  algorithm  will  not  have  to 
be  well  "tuned"  in  order  to  work  well. 

Note  there  are  problems  at  the  northwest  edges,  due  to  the  way  the 
algoritha  is  set  up.  In  future  software  development  I  might  program  a 
procedure  analogous  to  Box  and  Jenkins'  back-forecasting  ("backcasting") 
to  take  care  of  this.  Then,  after  a  first  segmentation,  the  data  would 
be  read  through  in  reverse  order,  so  that  northwest  pixels  become 
southeast  pixels,  easy  to  label  correctly. 
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5.  Model-selection  criteria 

A  question  of  obvious  importance  is  that  of  how  many  classes  of 

segment  to  use,  that  is,  what  should  be  the  value  of  the  parameter  k? 

Maximum  likelihood  estimation  alone  cannot  provide  an  answer,  because  it 

is  model-conditional;  that  is,  maximum  likelihood  applies  to  the  problem 

of  estimating  the  other  parameters,  for  a  fixed  value  of  k.  It  does  not 

apply  to  the  problem  of  estimating  k  because  the  likelihood  Itself 

changes  as  k  does.  An  approach  to  the  problem  of  estimation  of  k  is 

provided  by  model-selection  criteria,  such  as  Akaike's,  Schwarz*  and 

Kashyap's.  These  criteria  add  a  penalty  for  using  extra  parameters  to 

the  negative  log-likelihood  function.  Thus  a  small  score  is  a  good 

score  on  these  criteria.  Akaike's  criterion  is  based  on  a  heuristic 

estimate  of  the  cross-entropy  of  the  true  model  and  the  model  with  k 

classes.  (See  Parzen,  1982.)  Schwarz'  criterion,  perhaps  more 

convincing,  is  based  on  a  Bayesian  approach.  Kashyap  obtains  Schwarz' 

criterion  by  a  relatively  simple  expansion,  then  takes  this  expansion  a 

tern  further  to  obtain  another  criterion,  which  I  call  Kashyap's 

criterion.  The  additional  tern  seems  to  be  particularly  meaningful. 

These  model-selection  criteria  take  the  form 

-2  log  L(k)  ♦  a(n)n(k)  ♦  b(k), 

where  "log"  here  denotes  the  natural  (base-e)  logarithm,  and 

L(k)  =  likelihood  under  the  k-th  model,  maximized 

with  respect  to  the  parameters, 

a(n)  =  2,  for  all  n,  b(k)  =  0,  for  Akaike's  criterion, 

a(n)  *  log  n,  b(k)  =  0,  for  Schwarz'  criterion, 

a(n)  s  log  n,  b(k)  =  logtdet  B(k),  for  Kashyap's  criterion, 

det  =  determinant, 
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and 

B(k)  s  matrix  of  second  partial  derivatives  of  log  L(lc) 
with  respect  to  the  parameters. 

(The  satheaatical  expectation  of  B(k)  is  the  Fisher  information 
matrix.)  The  optimal  k  is  that  value  which  minimizes  the  criterion. 
Akalke's  criterion  generally  chooses  a  higher  value  of  k  (more 
parameters)  than  do  the  others.  Since  for  n  greater  than  8,  log  n  is 
greater  than  2,  Schwarz*  criterion  will  choose  a  value  of  k  no  larger 
than  that  chosen  by  Akalke's,  for  n  greater  than  8. 

6.  Plans  for  further  research 

Several  items  for  future  research  have  already  been  mentioned, 
including  programming  of  back  casting  and  use  of  the  Vlterbi  algorithm. 

Some  other  plans  for  additional  research  will  be  mentioned  now. 
First  there  is  the  matter  of  improved  estimation  of  the  distributional 
parameters.  For  purposes  of  discussion  focus  on  the  example  of 
segmenting  a  time  series  into  two  classes,  l.e. ,  labeling  each 
observation  as  an  "a"  or  a  "b."  There  is  a  truncation  resulting  from 
the  present  implementation  of  the  algorithm.  Namely,  only  those 
observations  labeled  "a"  will  be  used  in  updating  the  current  estimate 
of  the  mean  of  Class  A.  But  these  observations  are  a  truncated  sample 
from  Distribution  A,  and  the  algorithm  does  not  treat  then  as  such.  (At 
the  Conference,  Professor  Tukey  very  kindly  sought  me  out  the  day  after 
ay  presentation  to  point  this  out  to  me,  and  also  to  remark  on  the 
limitations  of  the  unidirectional  approach  of  the  algorithm;  see  below.) 
Rather  than  deal  with  the  truncation  ger  se,  I  had  planned  in  the  next 
stage  of  the  work  to  modify  the  estimators  by  doing  them  Bayes ianly, 
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e.g. ,  estimate  the  eean  of  Group  A  as  N/D,  where 

N  s  x(1)  Pr(alx(1))  +  x(2)  Pr(a|x(2))  ♦  ...  ♦  x(n)  Pr(alx(n)) 
and 

D  s  [Pr(alx(0)  ♦  Pr(alx(2))  +  ...  +  Pr(a|x(n))] 

(In  this  expression,  Pr(alx)  can  be  replaced  by  Pr(xla)  since  Pr(alx)  = 
f(x|a)Pr(a)/f(x)  and  Pr(a)/f(x)  is  a  coaaon  factor  which  will  cancel 
out.)  In  this  estieate,  all  the  observations  play  a  role,  whether 
labeled  as  "a"  or  nb,n  so  that  at  least  soae  of  the  bias  will  be  reaoved 
by  allowing  the  larger  "b"  observations  to  enter. 

I  obtain  the  likelihood  function  by  a  one-sided  approach, 
conditioning  any  given  pixel  on  the  results  in  the  pixels  to  its  north 
and  west.  A  two-sided,  full  neighborhood  approach  seems  preferable  to  a 
unidirectional  one.  The  unidirectional  approach  is  a  device  for  writing 
down  the  likelihood,  but  this  does  not  eean  one  has  to  be  wedded  to  that 
approach  in  the  iterative  updating.  That  is,  the  parameters  can  be 
estimated  with  a  full  neighborhood  approach. 

Another  bit  of  further  research  is  to  calculate  Kashyap's  criterion 
for  various  clustering  and  segmentation  models.  Also,  so  far  the 
method,  algorithm  and  software  have  been  developed  only  for  the  case 
where  the  observations  within  a  class  are  Independent  (and  Gaussian).  A 
next  step  will  be  autoregression  within  classes.  This  is  of  obvious 
importance  in  time  series,  and  in  the  context  of  images,  it  can  be  used 
to  model  textures.  Still  another  generalization  is  to  allow  some  forms 
of  time  or  state  dependency  in  the  transition  probabilities. 
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